Exercises

Simple reading

The file "../data/coordinates.txt" contains list of (x, y) value pairs. Read the values into two lists x and y.


In [ ]:

Nontrivial reading and conversion

The file "../data/CH4.pdb" contains the coordinates of methane molecule in a PDB format. The file consists of header followed by record lines which contain the following fields:

record name(=ATOM), atom serial number, atom name, x-,y-,z-coordinates, occupancy and temperature factor.

i.e.

ATOM      2 H                   -0.627  -0.627   0.627  0.00  0.00

Convert the file into XYZ format: first line contains the number of atoms, second line is title string, and the following lines contain the atomic symbols and x-, y-, z- coordinates, all separated by white space. Write the coordinates with 6 decimals:

5
Converted from PDB
C    0.000000   0.000000   0.000000
...

Only focus on printing the output now. Writing into a file comes next.

Hints:

  • separating the conversion logic to a function may help with readability and re-usability of your code

In [ ]:

Writing

Go ahead and edit the code above to write the output to a file.


In [ ]:

Bonus exercises

Delimiter separated values

Many data exchange formats are so-called delimiter separated values. The most commonly known of these is CSV.

There are multiple caveats in the format, e.g. European languages use comma (,) as a decimal separator and semicolon (;) as the field separator. Most pure-English systems use the dot (.) for decimal separation and the comma (,) for field separation.

Another family of systems uses whitespace, like space or tab characters to separate fields.

Python's csv library supports most of the variance in different formats and it can be a time-saving tool to those who use Python and deal with file formats a lot.

The file "../data/iris.data" is actually in CSV format even though the file ending doesn't explicitly say so (this is common).

Read in iris.data and write out a tab-separated file "iris.tsv" using the csv module.

Hint: because the first line of the input file has labels, csv.DictReader and csv.DictWriter are a good choice.


In [ ]:

Counting words

The file "../data/word_count.txt" contains a short piece of text. Determine the frequency of words in the file, i.e. how many times each word appears. Print out the ten most frequent words.

Read the file line by line and use the split() function for separating a line into words. The frequencies are stored most conveniently into a dictionary. The dictionary method setdefault can be useful here.

For sorting, convert the dictionary into a list of (key, value) pairs with the items() function:

words = {"foo" : 1, "bar" : 2}
print(words.items())
[('foo', 1), ('bar', 2)]

In [ ]:

Reading nucleotide sequences

Fasta is a fileformat for storing nucleotide sequences. The sequences consist of header line, starting with >, followed by one or more lines containing the amino acids of the sequence presented by single-letter codes:

>5IRE:A|PDBID|CHAIN|SEQUENCE
IRCIGVSNRDFVEGMSGGTWVDVVLEHGGCVTVMAQDKPTVDIELVTTTVSNMAEVRSYCYEASISDMASDSRCPTQGEA
YLDKQSDTQYVCKRTLVDRGWGNGCGLFGKGSLVTCAKFACSKKMTGKSIQPENLEYRIMLSVHGSQHSGMIVNDTGHET
...

The file "../data/5ire.fasta" contains sequences for multiple chains of Zika virus. Read from the file the sequence of chain C (the chain ids are given in the header, i.e. the chain above is A).

Find out which chains contain the subsequence LDFSDL.

Hints:

  • as the sequence is given in multiple lines, you should combine all the lines of a sequence into a single string. String object's .strip() method which removes newlines from the end of string is useful here.
  • you can split reading and converting to a standard format to a function to create a re-usable component and then separate the subsequence finding to another function

In [ ]: